Inter-warp Instruction Temporal Locality in Deep-Multithreaded GPUs

نویسندگان

  • Ahmad Lashgar
  • Amirali Baniasadi
  • Ahmad Khonsari
چکیده

GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This observation provides several opportunities to enhance GPUs. We discuss different possibilities and evaluate filter cache as a case study. Moreover, we investigate how variations in microarchitectural parameters impacts potential filter cache benefits in GPUs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Stencil Computations for NVIDIA Kepler GPUs

We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of the...

متن کامل

Effect of Instruction Fetch and Memory Scheduling on GPU Performance

GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of...

متن کامل

Building Multithreaded Architectures with Off-the-Shelf Microprocessors

Current strategies for supporting high-performance parallel computing often face the problem of large software overheads for process switching and interprocessor communication. This document presents the design of the Multi-Threaded Architecture (MTA) model, a multiprocessor architecture designed for the e cient parallel execution of both numerical and non-numerical programs. The basic MTA desi...

متن کامل

Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together ...

متن کامل

CTA-aware Prefetching for GPGPU

Several studies have been proposed to adopt memory prefetching schemes to reduce performance impact of long latency memory operations in GPUs. By leveraging a simple intuition that the consecutive warps are likely to have spatial locality, prior approaches prefetch two or four consecutive cache lines when there is a cache miss. Other approaches predict striding accesses by detecting base addres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013